library(limma)
library(edgeR)
library(AnnotationHub)
library(magrittr)
library(scales)
library(tidyverse)
library(pander)
theme_set(theme_bw())
The alignments in this analysis were generated by aligning each library (including technical replicates) to the Zebrafish genome GRCz11 (Ensembl Release 94) using (STAR v2.5.3a). Reads were assigned to each genes based on 100% exonic overlap and unique genomic alignments, based on gene descriptions in Danio_rerio.GRCz11.94.chr.gtf, also obtained from Ensembl Release 94. This set of gene descriptions were then loaded into R as an EnsDb object using the AnnotationHub() infrastructure.
ah <- AnnotationHub() %>%
subset(species == "Danio rerio") %>%
subset(rdataclass == "EnsDb")
ensDb <- ah[["AH64906"]]
genes <- genes(ensDb)
counts <- read_tsv("../2_alignedData/featureCounts/genes.out") %>%
set_names(basename(colnames(.))) %>%
set_names(str_remove_all(colnames(.), "Aligned.sortedByCoord.out.bam"))
Total counts from each library after assigning to genes
minCpm <- 1
minSamples <- 5
dge <- counts %>%
gather(key = "Library", value = "Counts", -Geneid) %>%
mutate(Library = str_replace_all(Library, "__-", "_WT"),
Sample = case_when(
str_count(Library, "_") == 2 ~ paste0(Library, "_1"),
str_count(Library, "_") == 3 ~ Library
),
Sample = str_remove_all(Sample, "_[12]$")) %>%
dplyr::filter(Counts > 0) %>%
group_by(Geneid, Sample) %>%
summarise(Counts = sum(Counts)) %>%
spread(key = Sample, value = Counts, fill = 0) %>%
as.data.frame() %>%
column_to_rownames("Geneid") %>%
magrittr::extract(rowSums(cpm(.) > minCpm) >= minSamples,) %>%
DGEList(genes = genes[rownames(.)]) %>%
calcNormFactors()
Extracted RNA from each individual zebrafish was split and run across two NextSeq lanes, giving technical replicates for the sequencing step only. These were combined to provide a single library for each sample, as technical replication at this experimental step is not of particular interest. Genes were retained for analysis if a CPM > 1 was observed for \(\geq\) 5 samples. This equated to about 31 reads for a gene in at least 5 samples for inclusion in downstream analysis, giving a total of 17,962 of the original 32,057 genes for DGE analysis.
dge$samples %<>%
mutate(Sample = rownames(.),
Genotype = str_extract(Sample, "(WT|FAD|FS)"),
Genotype = factor(Genotype, levels = c("WT", "FS", "FAD")),
group = as.integer(Genotype)) %>%
set_rownames(.$Sample)
v <- voomWithQualityWeights(dge, design = matrix(rep(1, ncol(dge)), ncol = 1))
Counts were also processed using the voom transformation using quality weights to allow for analysis using normal-based algorithms. Samples weights ranged between 0.3878 and 1.427, with the most down-weigted sample being a WT sample.
The first step for the analysis was to perform an MDS analysis. However, minimal separation was observed between sample groups, with the exception of the most strongly down-weighted sample, which separated very clearly from the remainder of the points. This sample may be a candidate for exclusion if no differentially expressed genes are able to be detected. A simple PCA also revealed that the first few principal components only capture a lesser amount of the total variability than may be expected.
mds <- plotMDS(v, plot = FALSE) %>%
extract2("cmdscale.out") %>%
set_colnames(paste0("Dim", 1:2))
MDS plot showing no clear groups within the data. Point sizes indicate sample weights as calculated by voomWithQualityWeights().
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| Standard deviation | 22.19 | 20.69 | 15.66 | 14.61 | 13.31 |
| Proportion of Variance | 0.2031 | 0.1765 | 0.1011 | 0.08807 | 0.07307 |
| Cumulative Proportion | 0.2031 | 0.3796 | 0.4807 | 0.5688 | 0.6419 |
In order to confirm the genotypes of all libraries, three seed sequences were searched in the trimmed reads (i.e. before alignment) which represented the WT, FAD or FS genotypes. The number of matches to each of the seeds was counted for each sample as a simple strategy for confirming genotypes.
| Genotype | Seed |
|---|---|
| WT | GTGCTCAACACTCTGGTC |
| FAD | AACATGATCAGTCTGATC |
| FS | TCGGTGCTCTGGTCATGA |
| Sample | WT | FAD | FS | Genotype | Observed | Confirmed |
|---|---|---|---|---|---|---|
| 1_FAD_5 | 4 | 4 | 0 | FAD | FAD | TRUE |
| 10_FS_1 | 6 | 0 | 4 | FS | FS | TRUE |
| 11_WT_5 | 20 | 0 | 0 | WT | WT | TRUE |
| 12_WT_4 | 17 | 0 | 0 | WT | WT | TRUE |
| 13_WT_3 | 23 | 0 | 0 | WT | WT | TRUE |
| 14_WT_2 | 24 | 0 | 0 | WT | WT | TRUE |
| 15_WT_1 | 17 | 0 | 0 | WT | WT | TRUE |
| 2_FAD_4 | 9 | 11 | 0 | FAD | FAD | TRUE |
| 3_FAD_3 | 7 | 10 | 0 | FAD | FAD | TRUE |
| 4_FAD_2 | 12 | 7 | 0 | FAD | FAD | TRUE |
| 5_FS_2 | 13 | 0 | 2 | FS | FS | TRUE |
| 6_FAD_1 | 8 | 11 | 0 | FAD | FAD | TRUE |
| 7_FS_5 | 7 | 0 | 1 | FS | FS | TRUE |
| 8_FS_4 | 0 | 12 | 0 | FS | FAD | FALSE |
| 9_FS_3 | 9 | 0 | 1 | FS | FS | TRUE |
One sample (8_FS_4) was labelled as an FS sample but presented with the contradictory FAD genotype and this was investgated manually. Notably, no WT or FS reads were found in this sample using this simple strategy.
dge$samples["8_FS_4","Genotype"] <- "FAD"
v$targets["8_FS_4","Genotype"] <- "FAD"
mm <- model.matrix(~0 + Genotype, data = v$targets) %>%
set_colnames(gsub("Genotype", "", colnames(.)))
cm <- makeContrasts(FSvWT = FS - WT,
FADvWT = FAD - WT,
FADvFS = FAD - FS,
levels = colnames(mm))
Three comparisons were defined with the first two being the difference between the two mutant families and the wild-type samples. The third comparison was defined as being between the two mutant groups.
vFit <- v %>%
lmFit(mm) %>%
contrasts.fit(cm) %>%
eBayes()
colnames(cm) %>%
sapply(function(x){
topTable(vFit, x) %>%
mutate(Comparison = x)
}, simplify = FALSE)